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Abstract —This paper investigates a general framework to discover 
categories of unlabeled scene images according to their appearances 
(i.e., textures and structures). We jointly solve the two coupled tasks in 
an unsupervised manner: (i) classifying images without pre-determining 
the number of categories, and (ii) pursuing generative model for each 
category. In our method, each image is represented by two types of 
image descriptors that are effective to capture image appearances 
from different aspects. By treating each image as a graph vertex, we 
build up an graph, and pose the image categorization as a graph 
partition process. Specifically, a partitioned sub-graph can be regarded 
as a category of scenes, and we define the probabilistic model of 
graph partition by accumulating the generative models of all separated 
categories. For efficient inference with the graph, we employ a stochastic 
cluster sampling algorithm, which is designed based on the Metropolis- 
Hasting mechanism. During the iterations of inference, the model of 
each category is analytically updated by a generative learning algorithm. 
In the experiments, our approach is validated on several challenging 
databases, and it outperforms other popular state-of-the-art methods. 
The implementation details and empirical analysis are presented as 
well. 

Index Terms —Unsupervised Categorization; Graph Partition; Genera¬ 
tive Learning; Scene Understanding 

1 Introduction 

Category discovery for unlabeled images is an important 
research topic with a wide range of applications such 
as content-based image retrieval [1], [2], image database 
management [3], [4], and scene understanding [5], [6], 
[7]. In this paper, we develop a unified framework to 
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categorize scene images in an unsupervised manner. 
Specifically, with this framework, a batch of unlabeled 
scene images can be automatically grouped into different 
categories according to their contents, and we simultane¬ 
ously generate the probability models for the categories. 

We pose the unsupervised image categorization as 
a graph partition task, i.e., each generated partition 
indicates a potential category; then we employ a novel 
clustering sampling algorithm for inference, which is 
an extension of Swendsen-Wang cuts [32] for greatly 
improving the inference efficiency. More specifically, 
the graph partition is formulated under a probabilistic 
framework that accumulates the generative models of all 
categories. Intuitively, the goodness of partitions is de¬ 
termined based on how well the learned models explain 
or generate the partitioned categories. Therefore, solving 
the optimal graph partition is equivalent to searching the 
maximum probability. 

Natural scenes usually contain diverse image con¬ 
tents related with different types of visual appearance 
patterns, e.g., inhomogeneous (or structural) textures 
(buildings, cars, roads, etc.), and homogeneous textures 
(grasses, water surfaces, etc.) [19]. Many studies [20], 
[21], [22] on designing image features show that the 
distribution-based descriptors (e.g., SIFT [23], HOG [24] 
and Textons [11]) and the binary operators (e.g., LBP 
and its variants [25], [26]) lead to state-of-arts on rep¬ 
resenting low-level image contents from different as¬ 
pects. The former features tend to well describe the 
inhomogeneous textures, while the latter can be applied 
to capture highly random textures [27]. Therefore, in 
our method, we represent an image with a number of 
image patches at multiple scales. Two effective image 
features, the Histogram of oriented gradients (HOG) [24] 
and the Center-Symmetric Local Binary Pattern (CS- 
LBP) [26], are employed to describe the image patches. 
Specifically, we define two types of visual words (i.e., 
inhomogeneous textural words and homogeneous tex¬ 
tural words), respectively, based on the two features. In 
literature, the significance of using combined features 
is also demonstrated in various vision tasks, e.g., near¬ 
duplicate image retrieval [1], [2], object detection [35], 
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Fig. 1. An overview of our framework. We formulate 
the problem of image category discovery as a graph 
partition task. In the left panel, the images are treated 
as graph vertices that are partitioned into subgraphs by 
turning off the graph edges. As shown in the right panel, 
the generative models for all partitioned categories are 
pursued simultaneously, and the models are also used 
to guide the inference of graph partition. The models are 
learned with two types of visual words: inhomogeneous 
textural words (ITWs) and homogeneous textural words 
(HTWs) defined based on two image descriptors. 


and video tracking [36], [37]. 

Moreover, we adaptively select informative features 
(i.e., visual words) for each scene class, along with 
the categorization procedure. Several methods of image 
categorization[9], [10] show that different categories of 
images are probably captured by different class-specific 
features. Some discriminative learning algorithms (e.g., 
Adaboost [28] and SVM) perform very well in feature 
selection. However, they are not suitable for our task, 
since these algorithms rely on negative data and are 
often sensitive to outliers. In contrast, our framework 
employs a generative learning algorithm based on infor¬ 
mation criteria [29], [30], so that we can fast pursue the 
generative models of categories without extra negative 
data. 

The framework of our approach is illustrated in Fig.l. 
The key contribution of this work is a general approach 
for automatic scene image categorization, in which the 
cluster (i.e., category) number is automatically deter¬ 
mined. The generative category models are learned and 
updated simultaneously together with the categorization 
procedure. Our method is evaluated on several public 
datasets and outperforms the state-of-the-art approaches. 
It is worth mentioning that the graph partition and 
category models are closely coupled. Given a state of 
partition, we can learn (or update) the probability mod¬ 
els while the category models can drive the partition to 
be refined. 


1.1 Related Work 

Most of the methods of scene image categorization in¬ 
volve a procedure of supervised learning, i.e., training 
a multi-class predictor (classifier) with the manually 
labeled images [8]. Unsupervised image categorization is 
often posed as clustering images into groups according 
to their contents (i.e., appearances and/or structures). In 
some traditional methods[9], various low-level features 
(such as color, filter banks, and textons [11]) are first ex¬ 
tracted from images, and a clustering algorithm (e.g., k- 
means or spectral clustering) is then applied to discover 
categories of the samples. 

To handle diverse image content, some effective im¬ 
age representations such as bag-of-words (BoWs) are 
proposed [12], [13], and they represent an image by 
using a pre-trained collection (i.e., dictionary) of visual 
words. Furthermore, Lazebnik et al. [14] present a spatial 
pyramid representation of BoWs by pooling words at 
different image scales, and this representation effectively 
improves results for scene categorization [15]. Farinella 
et al. [16] propose to build an effective scene represen¬ 
tation based on constrained and compressed domains. 

To exploit the latent semantic information of scene cat¬ 
egories, Bosch et al. [17] discuss the probabilistic Latent 
Semantic Analysis (pLSA) model that can explain the 
distribution of features in the image as a mixture of a few 
"semantic topics". As an alternative model for capturing 
latent semantics, the Latent Dirichlet Allocation (LDA) 
model [18] was widely used as well. 

On the other hand, the category number is required to 
be predetermined or be exhaustively selected in many 
previous unsupervised categorization approaches [7], 
[31]. In computer vision, the stochastic sampling algo¬ 
rithms [32], [33], [37] are shown to be capable of flexibly 
generating new clusters, merging and removing existing 
clusters in a graph representation. Motivated by these 
works, we propose to automatically determine the num¬ 
ber of image categories with the stochastic sampling. 

The rest of this paper is organized as follows. We first 
introduce the image representation in Section 2. Then 
we present the problem formulation in Section 3, and 
follow with a description of the inference algorithm for 
unsupervised image categorization in Section 4. Section 
5 discusses the learning algorithm for category model 
pursuit during the inference procedure. The experimen¬ 
tal results and comparisons are exhibited in Section 6, 
and the paper is concluded in Section 7. 

2 Image Representation 

In this section, we start by briefly introducing the two 
effective low-level image descriptors used in this work, 
and define two types of visual words to construct the 
dictionary of images. 

Previous works on designing image features can be 
roughly divided into two categories [27], [35]. The first 
one explicitly describes images with local gradients that 
are sensitive to structures (e.g., edges, boundaries, and 
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Fig. 2. Image representation. We represent an image with 
the pyramid Bag-of-Words (BOW) model with two types of 
visual words that are, respectively, defined based on two 
image descriptors, i.e., HOG [24] and CS-LBP [26]. 


junctions) and distinct textures (e.g., regions of clear de¬ 
tails). The other one reflects uncertain differences among 
pixels and thus tends to be suitable for incognizable 
random textures (e.g., complex regions, and cluttered 
patterns). Thus, we utilize two typical image descriptors, 
i.e., HOG [24] and CS-LBP [26], respectively, in this work. 
Following the studies on image representation [27], we 
refer a visual word c o as an ensemble or equivalence 
class of image patches that share the similar appearances. 
Letting h (•) be the histogram of an image feature, we 
define w as. 


a; = {A : h(A) = h-\- e}, (1) 

where h denotes the mean histogram of the image 
patches, and e is the statistical fluctuation, i.e., a very 
small value. According to the two image descriptors, 
we define two types of visual words, inhomogeneous 
textural words (ITWs) and homogeneous textural words 
(HTWs), together with the two descriptors. The benefit 
of combining the two types of words will be demon¬ 
strated in the experiments. 

To define ITWs, the input image domain is divided 
into a number of regular cells; at each pixel, a local 
gradient is calculated, and a histogram is pooled over 
each cell for different orientations. As illustrated in Fig. 2, 
we decompose an image patch by 2 x 2 cells and quantize 
the orientations into 8 angles. The dimension of this 
descriptor is thus 32. 

The HTWs are generated using the CS-LBP operator, 
which is computed at every pixel in the input image 
domain. It compares center-symmetric pairs of the given 
pixel and forms a binary vector. Given a pixel located at 
x with n = 8 neighborhood pixels that are equally spaced 
on a circle of radius, as the example illustrated in Fig. 2, 
the binary vector can be calculated as. 


x > 1 
otherwise 


h/2-l 

E K n i - n i+n/2) 2 *, b(x) = 
i =0 

where rii and n i+ ^/ 2 correspond to the intensity sea 
of center-symmetric pairs of pixels. We compute 1 


operator over all pixels in the domain; the obtained 
binary vectors can be converted into decimal strengths 
in the range of [0,15]. An example of a strength map is 
shown in Fig. 2. Since there are 4 cells divided, we further 
pool the strengths into a histogram with 16 x 4 = 64 bins, 
denoted as h b . 

Then we construct the dictionary to represent images 
with the visual words. In our implementation, we collect 
a large number of image patches from our database 
and compute the two descriptors for each, and group 
them into a batch of clusters (words) using the k-means 
algorithm. Thus, we obtain a dictionary W = = 

1,..., m}, where Wi is a visual word (i.e. ITW or HTW). 

Given an image I, we represent it with a spatial 
pyramid format, 1 + 4x2 = 9 blocks, i.e., 3 scales 
(resolutions) and 4 blocks in each scale except the top, 
as illustrated in Fig. 2 . In each block, the image domain 
is further decomposed into regular image patches that 
are mapped to the generated words. The image of a 
block J can be thus represented as a vector by using 
the dictionary, (ri(J), r 2 (J),..., r m (J)), where r*(J) is 
the response with the visual word c+, and 

r i (J) = V’(^V i (A)), (3) 

^ agj ' 

where 1 W .(A) = {1|0}, the indicator function, is used 
to indicate whether the image patch A G J matches 
with Wi. The matching is measured by either of the two 
descriptors, h a and h b , according to the type of word c^. 
Thus, we use XIa ej lw* (A) to indicate the number of the 
visual word c^ matching with the image block J. Here 
is the sigmoid function £(•) that is characterized by 
a saturation level. 

The image I is hence represented as 72(1), by concate¬ 
nating the vectors of all 9 blocks. 

3 Problem Formulation 

Given a set of unlabeled images V, the goal of our frame¬ 
work is to categorize them into an unknown number of 
disjoint K clusters, as 

n={7Ti,7r 2 ,...,7rx}, (4) 

where U^ =1 7Tk = V, 7r* D itj = 0, Vi ^ j. 

We first build a graph Go = (V, Eq), in which V = 
V = {Ii, I 2 ,..., I/v} is the set of graph vertices specifying 
the images to be categorized, and E 0 is the set of edges 
connecting neighboring graph vertices. Then we solve 
the task of graph partition by cutting edges of the graph, 
i.e., generating disjoint subgraphs. However, Go is a fully 
connected graph where the initial edge set E 0 could be 
very large. To reduce computational complexity, we shall 
compute a relatively sparse graph representation Go = 
(V,E) by pruning edges, E c E 0 . 

For any edge e G Eq, an auxiliary connecting vari¬ 
able fi e = {on|off} is first introduced, which indicates 
whether the edge is turned on or off. Then we can 
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define the edge connecting probability by measuring 
the similarity of two connected graph vertices. In our 
implementation. We define the similarity using the vi¬ 
sual words W. Specifically, for any vertices v E V, we 
represent it as, 72(1) = (ri(I), r 2 (I),..., r m (I)), where 
Vi( I) is the response of the word co l , as in Equation (3). 
Thus, we can define the connecting probability q e for two 
arbitrary images I s G V, It G V as, 

Qe(s,t) =p(He =on\v s ,v t ) = exp | - r[lCC(TZ s \\TZ t )] |, 

(5) 

where we denote 7 Z s = 72(I S ) and !Z t = 72(I t ) for 
notation simplicity. /CC() is the symmetric Kullback- 
Leibler distance for measuring two feature vectors, r is 
a constant parameter. q e {s,t) should be close to 0 if l s 
and I t naturally belong to different categories; the edge e 
connecting I s and I t could be then turned off with high 
probability. 

In practice, the edges with very low turn-on proba¬ 
bility can be directly removed. Furthermore, we enforce 
each vertex can be only connected to at most 6 neighbors. 
That is, for any vertex we keep 6 edges with the highest 
connecting probabilities, and remove the other edges. 
Therefore, we obtain the sparse graph G — (V, E) where 
E C Eq. 

With the graph representation, we pursue the genera¬ 
tive probability models for all categories, as 

$ = {0fc(I; Wife, 0/c), W k C W, fc = 1, • • •, K}, (6) 

where W k C W denotes the selected visual words for 
modeling the category ir k and Q k includes the cor¬ 
responding model parameters, i.e., the coefficients of 
words. The overall solution of image category discovery 
can be defined as, 

S={K, n,$), (7) 

where K is the inferred category number. The graph 
partition II and category modeling <I> can be solved 
together in a Bayesian inference framework. Assume 
that p{S) and p{V\S) denote the prior model and the 
likelihood model, respectively. p{S) can be simply mod¬ 
eled by incorporating an exponential function for K, 
as we impose no priors on II and <I>. The likelihood 
model p(V\S) = p(V |II, 4>) can be defined as a product 
of generative models of all separated categories, as we 
assume the models are generated independently to each 
other. We can then define the posterior probability of 
solution S as, 

p(S\V) <xp(S)p(V\S) 

K /g) 

= ex.p{—(3K} JJ Wk, Q/c), 

k = l 

where (3 is an empirical parameter for constraining 
the number of inferred categories. The category model 


(t>k{Wk, 0/c) is defined on the probabilistic distribution of 
the images in partition n k . The models for all categories 
can be learned and updated during the procedure of 
image categorization. 

4 Inference for Image Categorization 

The objective of inference is to search for the optimized 
solution S* by maximizing the posterior probability in 
Equation (8), 

S* = argmaxp(6'|P). (9) 

This optimization is very challenging due to two char¬ 
acters in our problem: (i) the unknown number of 
partitions, (ii) no confident initializations, i.e., lack of 
the initial category models. Therefore, we employ the 
stochastic sampling algorithm instead of using determin¬ 
istic inference algorithms. 

In the research area of stochastic inference, cluster 
sampling is very powerful for simulating Ising/Potts 
graphical models, which is designed under the 
Metropolis-Hasting mechanism. Recently, Barbu 
and Zhu [32] generalized the algorithm, namely 
Swendsen-Wang cuts (SWC), to solve graph partition 
in several vision applications. This algorithm enables 
us to effectively search for the maximum of posterior 
probability. It simulates a Markov chain containing a 
sequence of states in the solution space II and visits the 
Markov chain by realizing a reversible jump between 
any two successive states. 

In the following, we first introduce the SWC algo¬ 
rithm, and then discuss an extension [34] that greatly 
improves the inference efficiency. In general, the SWC 
algorithm iterates in two steps: 

1) Generate the connected components {CPs) by 
probabilistically turning off connecting edges in the 
graph. Graph vertices connected together by "on" 
edges form a connected component (denoted by 
CP for simplicity). Specifically, any two vertices 
in one CP are linked by a path that consists of 
several edges. For arbitrary edge e e E, we sample 
its connecting variable p e and decide it is turned 
on or off in this step. Then we obtain a few CPs, 
each of which is a set of connected graph vertices. 

2) Explore a new partition solution by relabeling one 
of the CPs. Assume that the current partition solu¬ 
tion is Sa and we are exploring a new solution Sb- 
Given one randomly selected CP, the reversible 
operators are developed to re-assign its label. For 
example, the selected CP can be merged into cur¬ 
rent separated category by receiving the same label 
with the category; otherwise, a new category can 
be created if the selected CP receives a new label. 

We design the algorithm by the Metropolis-Hastings 
mechanism [32]. Let Q{Sa Sb) be the proposal 
probability for moving from state Sa to state Sb, and 
conversely, Q(Sb —^ Sa) is the proposal probability from 
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Fig. 3. Illustration of the compositional Swendsen-Wang cuts algorithm for exploring a new solution state. 


Sb to Sa- The acceptance rate of the moving from Sa to 
Sb is. 


&{Sa Sb) = min 


( Q(S b S A ) 
V ’ Q(Sa Sb) 


p(Sb \D) 
V(S A \V) 


( 10 ) 


For any state transition, the proposal probability usually 
involves two aspects: (i) the generation of CP, and (ii) 
the label assignment of CP. In our method, we make the 
CP be assigned randomly with a uniform distribution, 
so that the proposal probability can be simplified. Thus, 
the ratio of proposal probability is calculated by, 

Q(S b -> S A ) = Ueec^-Qe) 

Q(S a -► S B ) Ileec A (l ~ Q e )' 

where Ca denotes the edge set of edges that are proba¬ 
bilistically turned off for generating the CP on state Sa, 
and similarly Cb is the turning-off edge set on Sb- Here 
we name Ca or Cb as a "cut", following [32]. 

To further accelerate the convergence of inference, we 
employ an improved version of the SWC algorithm that 
was originally proposed by us for video shot categoriza¬ 
tion [34]. In the original algorithm, only one CP is se¬ 
lected and processed in each step of solution exploration. 
In our method, we process a number of CPs together 
by coupling them into a combinatorial cluster. We thus 
regard this algorithm as the compositional SWC (CSWC). 
The CSWC algorithm is able to enlarge the searching 
scope during the sampling iterations, resulting in faster 
convergence than the original version. 

Fig. 3 illustrates the idea of CSWC. Given a current 
state Sa ( as shown in Fig.3 (a)), we can generate a 
number of CPs by turning off a few edges (as shown 
in Fig.3 (b)). Then we construct a higher layer graph G 
based on these CPs. In this graph, we treat each CP 
as a vertex, and link any two neighboring CPs by an 
edge, as shown in Fig.3(c). Within G, we can generate the 
combinatorial cluster, where several CPs are selected. 

Similar with the definitions in G, we calculate the turn¬ 
on probability q CP for an edge in G according to the 


similarity of two connected vertices (i.e., CPs), which 
can be derived from the original graph G. Specifically, 
given two neighboring CPi and CPj , we measure their 
similarity by aggregating all the edges in G that connects 
the vertices in G belonging to CPi and CPj , respectively. 
Thus, we define the edge probability in G as, 

q CP oc [l - JJ(1 - q e )\, (12) 

e =< s,t >,s e CP,,t e CPj. 

By probabilistically turning off the edges in G, we 
can also generate several connected components, and 
we regard them as combinatorial clusters to distinguish 
the CPs in G. In Fig.3(d), 4 combinatorial clusters are 
generated. Different with the algorithm in [34], we allow 
more than one combinatorial clusters to be selected in 
this step, and we assign labels to the them. In this 
way, we generate a new solution of graph partition 
accordingly. In the implementation, we enforce each 
combinatorial cluster being processed as a atomic unit, 
i.e., all original CPs in the compositional cluster will 
receive the same label. As Fig.3 illustrates, to go from Sa 
to Sb, the original SWC algorithm needs at least three 
steps, whereas for CSWC there is only one step. Note 
that we visualize only one selected CP in Fig.3 (d) for 
illustration. 

During the inference, the posterior probability p(S\T>) 
can be changed, as we keep the category models updated 
with the categorization operation. Note that we only 
need to update the models of the categories where we 
add or remove images within them. We will introduce 
the category model learning in the next section. 

5 Category Model Learning 

Given a fixed graph partition II, we learn the probability 
model (f)k(Wk,@k) for each category by selecting the 
most informative visual words. Since all scene images 
in V are unlabeled, and no extra negative samples are 
provided, we employ an efficient generative learning 
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Algorithm 1: The sketch of our approach 

Input: Image dataset V = {Ii,..., I/v}, and visual 
words W = {<jj ly • • • i } 

Output: The categorization solution S = ( K , II, T>) 

1. Initialization; 

(1) Represent each image I* with the visual 
words, K(Ii) = 

(2) Create the graph G 0 = (V,E Q ), and compute 
the turn-on probability q e according to Equation (5), 

Ve G E 0 . 

(3) Remove the edges with low turn-on 
probability deterministically, and generate the 
sparse graph G = (V, E). 

2. Repeat for cluster sampling; 

(1) At the current solution Sa, generate the CPs 
by probabilistically turning off connecting edges in 
the graph G. 

(2) Construct a high layer of graph G based on 
CPs. 

(3) Generate combinatorial clusters by 
probabilistically turning off edges in G. 

(4) Select several combinatorial clusters and 
re-assign labels to them. 

(5) Accept the new solution Sb according to the 
acceptance rate defined in Equation (10). 

(6) Update the generative models, 0(1*^; W k , ©&), 
for the categories that have been modified 
according to solution Sb- 

(7) Update the posterior probability (S\V) 
accordingly. 

3. Output the final solution S* = argmaxp(S'|P). 


algorithm for this task, namely information pursuit [29], 
[38]. Similar approaches of combining generative learn¬ 
ing in unsupervised categorization are discussed in [10]. 

Suppose the category n k is governed by an underlying 
target model the model pursuit can be solved by 
additively searching for a sequence of features, starting 
from an initial model 4>k$- At each step t, the model 
(j) k: t is updated to gradually approach . Here that 
we ignore k for notation simplicity. In the manner of 
stepwise pursuit, the new model 4> t is updated by adding 
a new feature uj t based on the current model fa-i, and 
uj t imposes an additive constraint, as. 


4>(L;W,G) = <Ml)§expj]TA t r t (I)), (14) 

where Z = Y\z t and 0 = (Ai,...,A T ). z t normalizes 
the sum of the probability to 1, and X t is the coefficient 
weight of the selected feature uo t - In our implementation, 
we specify the initial model </>o as a uniform distribution 
over all words. 

With this definition in Equation (14), the model is 
updated by solving X t and r t at each round t. Here 
we discuss a MaxMin-KL algorithm for this goal, which 
iteratively performs with two following steps. 

Step 1: Max-KL. The most informative feature is 
selected to update the current model. This step optimizes 
the following problem, given the candidate features, 

r t * = axg max JC(0 t 110t_i) (15) 

rt 

= arg max A t E (f)f [r t \ - log z t . 

This step could be computational expensive as we need 
to sample the model distribution of the previous 
round t — 1. Following recent works on image template 
learning [38], [27], we can simplify the computation 
by enforcing the visual words have little overlap. In 
particular, all features can be selected independently. The 
optimization in Equation (15) can be approximated as, 

^ = arg max E <j)f [r t ] - E 0O [r t ], (16) 

where E^ [r t \ can be ignored, as it is a constant calcu¬ 
lated on the initial model </> 0 . We calculate E^ f [r t \ by the 
mean response values, 

n k 

E <l>1 [r t ] = —Y j rtili), (17) 

nk u 

where n k is the number of images belonging into the 
k -th category. 

Step 2: Min-KL. Given the selected feature r t/ this 
step is to compute its corresponding weight X t and 
normalization term z t by 

K = argmin/C(0t||0t_i) (18) 

At 

s.t. Efa [r t ] = E ( j Jf [rt]. 


fa = -0 t _ie A * r % (13) 

zt 

s.t. E <j>t [r t ]=E <l>J [rt\, 

where r t denotes the response of the word uj t . E^^rt) 
represents the expectation of feature over the un¬ 
derlying model, which can be calculated by averaging 
feature responses over positive samples. E^ t [r t \ denotes 
the feature expectation on the new model. Follow [29], 
[38], we can derive the probability model by T rounds 
of model pursuit as the following Gibbs form. 


This optimization in Equation (18) can be solved ana¬ 
lytically according to the proof in [27], and we conduct 
that. 


\ i %W( lfe ^oN) 

4 og (i -E^inUEM 

Zt = exp A tEfo [r t ] + 1 - E^ 0 [r t \. 


(19) 


Since we can analytically pursue this model by select¬ 
ing a number T of informative features, the model in 
Equation (14) can be further simplified into the following 
form. 
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T 

0(i; e) = 0o(DlI 

t 


— exp{A t r t (I)} 
Zt 


( 20 ) 


The proposed algorithm in the above is simple and 
fast, because the value of E^^rt] and E^rt] for each 
feature only need to be computed once in the off-line 
stage. Hence, we can embed the learning algorithm to 
keep the category model updated during the iterating 
procedure of categorization. 

Algorithm. 1 summarizes the overall sketch of our 
framework. 


6 Experiments 

In the experiments, we apply our method to discover 
categories for a batch of unlabeled images with diverse 
appearances, and compare with other state-of-the-art 
approaches. 

6.1 Datasets and Metrics 

We use three challenging public databases for validation: 
MIT-Scene 1 , Corel 2 , and UlUC-Scene 3 . Moverover, these 
three databases are mixed together as a larger testing set 
for further evaluation. 

The MIT-Scene database contains 2688 images clas¬ 
sified into 8 categories according to their meaningful 
semantics: coasts, forest, mountains, country, highways, 
city views, buildings, and streets. The number of images 
in each category is in the range of 260 ^ 410, and the 
resolution of each image is 256 x 256 pixels. The Corel 
dataset includes 1000 natural scenes with the resolution 
256 x 384 pixels of 10 semantic categories: bus, coasts, 
dinosaurs, elephants, flower, food, horses, mountains, 
people, and temples. Each category contains 100 images. 
The UlUC-scene database, which is an extension of MIT- 
Scene, contains 4485 images classified into 15 categories, 
and their themes are various, e.g., mountains, forest, 
offices, and living rooms. The mixed dataset is the union 
of all the three databases, including totally 5485 images 
of 23 categories. Note that there are a few overlapping 
categories among them. 


TABLE 1 

The inferred cluster number in each time of experiment. 


# 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

I 

8 

9 

8 

8 

9 

10 

10 

11 

8 

9 

II 

9 

10 

10 

11 

10 

9 

12 

11 

12 

10 

III 

16 

17 

15 

16 

16 

15 

18 

16 

15 

17 

IV 

27 

24 

24 

26 

24 

25 

24 

26 

25 

25 


#: No. of experiments; 


I: Experiments on the MIT database; 

II: Experiments on the Corel database; 

III: Experiments on the UIUC database; 

IV: Experiments on the mixed dataset. 

1. http://people.csail.mit.edu/torralba/code/spatialenvelope/ 

2. http: / /wang.ist.psu.edu/docs/related.shtml 

3. http://www-cvr.ai.uiuc.edu/ponce_grp/data/index.html 


The usual evaluation metric for categorization is Aver¬ 
age Precision, and the number of categories is assumed 
to be predetermined. In this work, we adopt the two 
recently proposed metrics for unsupervised categoriza¬ 
tion [7], [34], i.e.. Purity and Conditional Entropy. In brief, 
the larger value of Purity implies the better performance 
in categorization and Conditional Entropy inversely. 

For the input set V, including a number of N images, 
suppose the underlying category number is L and the 
corresponding groundtruth category labels are denoted 
by X = {xi e [1, L],i = 1, ...,1V}. A testing system 
groups the images into K categories, { D k , k = l,... ,K}, 
with the inferred category labels Y = {yi e [l,K],i = 

1 ? ,1V}. It is worth mentioning that K could be not 

equal to L, as we allow the algorithm to automatically 
determine the number of categories. The metric Purity 
and Conditional Entropy are defined as. 


Purity(X\Y) = p(y) maxp(x\y), (21) 

c ' x£X 

yev 

H(X\Y ) = £ P(y) Y, p( x \y) log -Ay (22) 

y eY xex ' y) 

where p(y) = and p{pc\y) can be simply estimated 
from the observed frequencies in categorized data, re¬ 
sulting in an empirical estimation. \D y \ represents the 
number of images in one category. 

6.2 Parameter settings and results 

We carry out the experiments on a PC with Quad-Core 
3.6GHz CPU and 32GB memory. We set the parameter 
/3 = 300 in the probabilistic formulation (in Equation (8)), 
and the parameter r = 0.2 in the probabilistic edge 
definition (in Equation (5)). 

In our experiments, we first randomly collect a num¬ 
ber of image patches with different scales from the 
datasets and generate 500 ITWs and 500 HTWs as in¬ 
troduced in Section 2. There are totally 1000 words in 
the dictionary. 

We carry out our method 10 times and use the aver¬ 
age performance for comparison. The inferred category 
number may not be identical each time, as reported in 
Table.l. The average category number is 9.0 for the MIT- 
Scene, 10.4 for the Corel, 16.1 for UlUC-Scene, and 25.0 
for the mixed dataset. 

For comparison, several state-of-the-art approaches are 
implemented based on the codes released by the original 
researchers, including pLSA [17], Affinity Propagation 
(AP) [39] and LDA [40]. For the pLSA approach, we 
extract color SIFT descriptors to construct a dictionary 
of 1000 visual words following their original imple¬ 
mentation. For the other two approaches, i.e., AP and 
LDA, we use our image representations (i.e., two types 
of words extracted within the spatial pyramid) as the 
inputs of the clustering algorithms. In addition, the k- 
means clustering algorithm is adopted as the baseline. 
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Fig. 4. The selected visual words for 15 categories of the UlUC-Scene database. For each category, we show the 
top 40 informative visual words according to their information gains (the vertical axis). The different colors represent 
different types of words (i.e., red for ITWs and blue for FITWs). 



Iteration number Iteration number Iteration number 

(a) Convergence comparison on the MIT database (b) Convergence comparison on the Corel database (c) Convergence comparison on the UIUC database 

Fig. 5. Convergence comparisons of the CSWC algorithm and the original version. The experiments are executed on 
the three databases: MIT-Scene in (a), Corel in (b), and UlUC-Scene in (c). In each chart, the horizontal axis and the 
vertical axis, respectively, represent the iterating step and the target energy (- logp(S\V)). The dashed (green) curves 
are from the original SWC algorithm and the solid (blue) curves are from the CSWC algorithm, respectively. 


with either our representations or the gradient-based 
GIST features [6]. These methods use exactly the same 
experiment settings as our approach for fair evaluation, 
but the category number for them is manually fixed. 


i.e., 8 for the MIT-Scene database, 10 for the Corel, 15 
for the UlUC-Scene, and 23 for the mixed dataset. The 
quantitative performances are reported in Table 2 and Ta¬ 
ble 3 based on the two benchmark metrics, respectively 


TABLE 2 

Performance comparison via Purity (higher is better) 



K-means 

GIST 

pLSA 

LDA 

AP 


Ours 





ITW+HTW 

ITW 

HTW 

MIT 

0.5529 

0.5770 

0.6457 

0.6096 

0.5546 

0.6721 

0.5764 

0.6000 

Corel 

0.5337 

0.5644 

0.6070 

0.5980 

0.5612 

0.6203 

0.6160 

0.6040 

UIUC 

0.4487 

0.4514 

0.5074 

0.5449 

0.5850 

0.5964 

0.5613 

0.5148 

Mixed 

0.3632 

0.3801 

0.4136 

0.4801 

0.5017 

0.5295 

0.4836 

0.4226 
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Fig. 6. Time complexity analysis with the increase of data scale. This analysis is performed on the mixture of MIT 
database and Corel database. In each figure the vertical axis represents the speed (iteration step) of convergence; 
the horizontal axis in (a) represents the number of images with fixed 16 underlying categories, in (b) the number of 
categories with the fixed number of images, and in (c) the number of images with various underlying categories. 


In general, our method outperforms other comparing 
approaches. We also evaluate our method with only one 
type of visual words, i.e., either ITW or HTW, so that the 
benefits of combining two types of features are clearly 
illustrated. 

In our method, the clustering inference is performed 
simultaneously with the feature selection for category 
modeling. In Fig.4, we show the selected visual words 
of different types, i.e., ITWs and HTWs, for different 
categories, and the coefficients of top 40 informative 
words are plotted as well. The results are very reasonable 
that the selected words match with the appearances of 
the images very well. 

6.3 Analysis 

In the following, we conduct additional empirical anal¬ 
ysis to validate the advantages of our approach. 

First, we analyze the convergence efficiency of the 
CSWC algorithm and compare with the original version. 
Fig.5 shows the convergence curves of the target energy, 
i.e., - log P(S\V), with the increasing iteration steps. 
Note that the energy goes inversely with the posterior 
probability. We can observe that the CSWC algorithm 
converges significantly faster on all the three databases. 

Moreover, we analyze the computational complexity 
of our approach. The space complexity (i.e., computer 
memory) is basically related with the size of the visual 
word dictionary and the number of images to be cat¬ 
egorized. Here we mainly discuss the time complexity 


that quantifies the amount of time taken by an algo¬ 
rithm conditional on the asymptotic size of the input. 
Using the big O notation, which excludes coefficients 
and lower order terms, the theoretic time complexity of 
our approach is 0(MKT), where M is the number of 
sampling steps, K is the category number, and T is the 
average number of features selected for each category. 
As we discussed in Section 5, the generative model can 
be pursued analytically by greedy feature selection, and 
the feature responses on all images can be calculated 
off-line. In addition, only a few (i.e., < K) categories 
need to be updated in each iteration. Hence, we roughly 
consider the time complexity determined by the sam¬ 
pling steps. On the mentioned hardware, each iteration 
costs averagely 0.043s (MIT-Scene), 0.015s (Corel), and 
0.052s (UlUC-Scene), respectively, on the three databases. 
In Fig. 6, we visualize the numbers of iteration steps 
on two types of data scales: the total number of images 
to be categorized and the underlying category number. 
From the results, we can observe that the steps increase 
in the nonexponential order, which is accordant with our 
analysis. 

At last, in order to reveal how much the vocabulary 
size affects the results, we present an experiment in 
Fig. 7, where the categorization results are reported with 
different sizes of vocabulary on the mixed dataset. The 
conclusion can be drawn that our approach is not sensi¬ 
tive on the vocabulary size, as we incorporate the model 
learning (i.e., feature selection) with the categorization. 
And this property enables us to avoid elaborately tuning 


TABLE 3 

Performance comparison via Conditional Entropy (lower is better) 



K-means 

GIST 

pLSA 

LDA 

AP 


Ours 





ITW+HTW 

ITW 

HTW 

MIT 

1.2465 

1.2102 

1.0156 

1.1836 

1.1400 

0.8963 

1.1536 

1.1145 

Corel 

1.3105 

1.2136 

1.1234 

1.1371 

1.2577 

1.0909 

1.1036 

1.1154 

UIUC 

1.5020 

1.4564 

1.4322 

1.3146 

1.2121 

1.1581 

1.2150 

1.3948 

Mixed 

1.7603 

1.7172 

1.6811 

1.5828 

1.5127 

1.4328 

1.4971 

1.5955 
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the size of vocabulary in practice. 

■ Purity (higher is better) 



Fig. 7. The influence of vocabulary size. This analysis 
is executed on the mixed database (of 23 categories). 
The upper figure and the lower figure, respectively, rep¬ 
resent the results via Purity and Conditional Entropy. The 
horizontal axis represents the vocabulary size. Note that 
we generate equal size for the two types of words in the 
testings. 


7 Conclusions 

This paper studies a general framework for automat¬ 
ically discovering image categories via unsupervised 
graph partition. Compared with the previous methods, 
the advantage of the proposed method is identified 
on several public datasets and summarized as follows. 
First, images are represented by two types of visual 
words, ITWs and HTWs, which capture image appear¬ 
ances from different aspects. Second, we perform feature 
selection simultaneously with the clustering procedure, 
guided by a generative model for each category. Third, 
we employ a stochastic sampling algorithm for efficient 
inference, in which the clustering number is automati¬ 
cally determined. 
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